😊 Hello Thursday

Author

Blair Wang

Introduction

Hello world something something

Overview

flowchart LR
  x1[📜 CSV file]
  x2[🧹 Cleanup columns]
  x3[💾 Save CSV file]
  x4[🎲 Preview random 10 rows]
  x5[🕵 Look for coinciding text values]
  x6[🧮 Aggregate data]
  x7[📊 Visualise data]
  x1 --> x2
  x2 --> x4
  x2 --> x5
  x2 --> x6
  x6 --> x7
  
  x2 --> x3

Step 1. Load the libraries

Code
library(tidyverse)
library(plotly)

Step 2. Open the CSV file

Please note that in the code, we carefully identify each of the data types.

Code
cso_dha79 <- read_csv("../../datasets/DHA79.20250129T140111.csv", col_types = list (
  `STATISTIC` = col_character(),
  `Statistic Label` = col_character(),
  `TLIST(A1)` = col_integer(),
  `Year` = col_integer(),
  `C02076V02508` = col_character(),
  `Age Group` = col_character(),
  `C02199V02655` = col_character(),
  `Sex` = col_character(),
  `C02855V03433` = col_character(),
  `ISHMT` = col_character(),
  `UNIT` = col_character(),
  `VALUE` = col_integer()
))
print(cso_dha79)
# A tibble: 70,980 × 12
   STATISTIC `Statistic Label` `TLIST(A1)`  Year C02076V02508 `Age Group` 
   <chr>     <chr>                   <int> <int> <chr>        <chr>       
 1 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 2 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 3 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 4 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 5 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 6 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 7 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 8 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
 9 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
10 DHA79C01  Inpatients               2017  2017 215          0 - 14 years
# ℹ 70,970 more rows
# ℹ 6 more variables: C02199V02655 <chr>, Sex <chr>, C02855V03433 <chr>,
#   ISHMT <chr>, UNIT <chr>, VALUE <int>

Step 3. Clean up the dataset

In this step, we:

  1. Rename the columns, so they are easier to work with programmatically;
  2. Select only the columns we need for the purpose of this analysis;
  3. Handle NA values
Code
# Rename the columns, so they are easier to work with programmatically
cso_dha79 <- cso_dha79 %>% 
  rename(px_type = "Statistic Label") %>%
  rename(calendar_year = "Year") %>%
  rename(age_group = "Age Group") %>%
  rename(medical_sex = "Sex") %>%
  rename(medical_ishmt = "ISHMT") %>%
  rename(num_cases = "VALUE")

# Select only the columns we need for the purpose of this analysis
cso_dha79_selectedcols <- cso_dha79 %>% select(px_type, calendar_year, age_group, medical_ishmt, medical_sex, num_cases)

# Handle NA values
cso_dha79_selectedcols <- cso_dha79_selectedcols %>% mutate(num_cases = replace_na(num_cases, 0))

Step 4. Handle aggregations in the datset

The dataset contains some rows which aggregate the data of other rows (!):

  • Where medical_ishmt = “All causes (A00-Z99 (excl. V, W, X & Y ))”
  • Where medical_ishmt = “All causes excluding dialysis daycases (A00-Z99 (excl. Z49.1 & Z49.2, V, W, X & Y)) ))”
  • Where medical_sex = “Both sexes”
  • Where age_group = “All ages”

These will mess up our own aggregation process, so we will need to handle them.

Note: ISHMT = International shortlist for hospital morbidity tabulation (ISHMT). For more info see this link here.

Code
cso_dha79_selectedcols <- cso_dha79_selectedcols %>%
  filter(medical_ishmt != "All causes (A00-Z99 (excl. V, W, X & Y ))") %>%
  filter(medical_ishmt != "All causes excluding dialysis daycases (A00-Z99 (excl. Z49.1 & Z49.2, V, W, X & Y)) ))") %>%
  filter(medical_sex != "Both sexes") %>%
  filter(age_group != "All ages")

Step 5. Plot the data using Plotly

gt?

Have a look at the part that starts with library(gt). This is an alternative to print(data_for_plotly).

Prepare the data for plotly:

Code
# Data for plotly
data_for_plotly <- cso_dha79_selectedcols %>%
  group_by(age_group, calendar_year) %>%
  summarize(
    total_num_cases = sum(num_cases),
    .groups = "keep"
  ) %>%
  ungroup()

library(gt)
data_for_plotly %>%
  gt() %>%
  fmt_number(
    columns = total_num_cases,
    sep_mark = ",",
    decimals = 0
  ) %>%
  opt_table_font(
    size = "9pt",
    font = list(
      "Courier",
      "monospace"
    )
  )
age_group calendar_year total_num_cases
0 - 14 years 2017 406,852
0 - 14 years 2018 414,298
0 - 14 years 2019 399,405
0 - 14 years 2020 299,725
0 - 14 years 2021 326,061
0 - 14 years 2022 364,675
0 - 14 years 2023 366,342
15 - 24 years 2017 271,048
15 - 24 years 2018 278,985
15 - 24 years 2019 276,314
15 - 24 years 2020 228,310
15 - 24 years 2021 250,518
15 - 24 years 2022 266,733
15 - 24 years 2023 278,088
25 - 34 years 2017 558,525
25 - 34 years 2018 533,437
25 - 34 years 2019 527,279
25 - 34 years 2020 451,045
25 - 34 years 2021 491,330
25 - 34 years 2022 474,808
25 - 34 years 2023 502,524
35 - 44 years 2017 682,232
35 - 44 years 2018 671,191
35 - 44 years 2019 684,879
35 - 44 years 2020 590,215
35 - 44 years 2021 643,042
35 - 44 years 2022 651,111
35 - 44 years 2023 695,418
45 - 54 years 2017 714,570
45 - 54 years 2018 717,052
45 - 54 years 2019 728,718
45 - 54 years 2020 620,874
45 - 54 years 2021 672,493
45 - 54 years 2022 719,898
45 - 54 years 2023 789,992
55 - 64 years 2017 912,387
55 - 64 years 2018 918,492
55 - 64 years 2019 950,748
55 - 64 years 2020 807,007
55 - 64 years 2021 860,533
55 - 64 years 2022 927,146
55 - 64 years 2023 1,008,080
65 - 74 years 2017 1,083,371
65 - 74 years 2018 1,108,931
65 - 74 years 2019 1,137,802
65 - 74 years 2020 976,497
65 - 74 years 2021 1,033,032
65 - 74 years 2022 1,109,468
65 - 74 years 2023 1,177,083
75 - 84 years 2017 757,974
75 - 84 years 2018 788,348
75 - 84 years 2019 824,872
75 - 84 years 2020 714,030
75 - 84 years 2021 783,437
75 - 84 years 2022 879,043
75 - 84 years 2023 952,249
85 years and over 2017 234,907
85 years and over 2018 252,414
85 years and over 2019 264,749
85 years and over 2020 228,528
85 years and over 2021 261,273
85 years and over 2022 286,837
85 years and over 2023 306,589
Code
# print(data_for_plotly)

Show the plot:

Code
plot <- data_for_plotly %>%
  plot_ly(
    x = ~total_num_cases,
    type = "box",
    name = ~age_group
  )

plot